A Probabilistic Hierarchical Clustering Method for Organising Collections of Text Documents
نویسندگان
چکیده
In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically both asymmetric and symmetric models based on the multinomial and binomial distributions are most appropriate. An Expectation Maximisation parameter estimation method is provided for all of these models. An experimental comparison of the models is obtained for two extensive online document collections.
منابع مشابه
Probabilistic Hierarchical Clustering Method for Organizing Collections of Text Documents
In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically both asymmetric and sy...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملIncremental and Hierarchical Document Clustering
Over the past few decades, the volume of existing text data increased exponentially. Automatic tools to organize these huge collections of documents are becoming unprecedentedly important. Document clustering is important for organizing automatically documents into clusters. Most of the clustering algorithms process document collections as a whole; however, it is important to process these docu...
متن کاملIncremental Construction of Topic Hierarchies using Hierarchical Term Clustering
Topic hierarchies are very useful for managing, searching and browsing large repositories of text documents. The hierarchical clustering methods are used to support the construction of topic hierarchies in a unsupervised way. However, the traditional methods are ineffective in scenarios with growing text collections. In this paper, an incremental method for the construction of topic hierarchies...
متن کاملSelf-Organising Maps for Hierarchical Tree View Document Clustering Using Contextual Information
In this paper we propose an effective method to cluster documents into a dynamically built taxonomy of topics, directly extracted from the documents. We take into account short contextual information within the text corpus, which is weighted by importance and used as input to a set of independently spun growing Self-Organising Maps (SOM). This work shows an increase in precision and labelling q...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000